Annealing schedule from population dynamics^ 
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We introduce a dynamical annealing schedule for population-based optimization algorithms with 
mutation. On the basis of a statistical mechanics formulation of the population dynamics, the 
mutation rate adapts to a value maximizing expected rewards at each time step. Thereby, the 
mutation rate is eliminated as a free parameter from the algorithm. 
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I. INTRODUCTION 

Population-based optimization algorithms p| -p| have 
been successfully applied to problems in physics Hpf and 
beyond 0] . This class of algorithms is based on the simul- 
taneous tracking of more than one point in search space 
(a "population" , in analogy to biological evolution [Q), in 
order to make trapping in local optima less likely during 
the process of optimization. In addition, stochastic noise 
is used to generate random displacements of the search 
points ("mutations"), performing a local optimization. 
A major problem in these algorithms is the adjustment 
of the noise level for a given optimization task. In the 
beginning of the search, high noise levels help to identify 
promising regions of the search space, while for a subse- 
quent fine tuning low noise works best. This problem is 
well known from simulated annealing Q , an optimization 
algorithm where noise is introduced by means of a formal 
temperature. Lowering, or "annealing" , the temperature 
from high to low values in the course of the optimization 
leads to improved results compared to an optimization 
at fixed temperatures. However, there remains the prob- 
lem of choosing a suitable annealing schedule for a given 
optimization problem j^j. The same problem occurs in 
population-based optimization algorithms and will be ad- 
dressed in the remainder of this paper. 

For some population-based algorithms heuristics have 
been proposed that adjust the noise rate during opti- 
mization. For example, when noise is implemented as 
random steps of fixed Euclidean distance in the search 
space, its step size can be adjusted according to an esti- 
mate for the most promising next step on the basis of the 
previous one ||. Another approach, taken by Davis 
uses a set of noise operators competing for high scores in 
producing low energy search points. Although such ap- 
proaches work in practice, they have not been formally 
established on the basis of a dynamical formulation of the 
algorithm. A major problem is the enormous complexity 
of the dynamics of population-based algorithms. 

Recently, this problem has attracted parts of the 
physics community, applying statistical mechanics meth- 



ods to the algorithm dynamics. Priigel-Bennett and 
Shapiro |jic|| described the average population dynam- 
ics in terms of the distribution of energy values in the 
population at each time step. The observables are the 
cumulants of the energy distribution of the population. 
Selection of the "fittest" (the low energy members) is 
completely defined in terms of these variables. For cer- 
tain energy functions, this enables a prediction of the 
algorithm dynamics to high accuracy over large numbers 
of generations |Tj|j . In Ref. |Q it was proposed to use 
this formalism in order to determine an annealing sched- 
ule from the predicted dynamics. However, each of the 
two immediate routes faces a major obstacle: The ana- 
lytical approach is only feasible for exactly known energy 
functions with simple properties and involves a compli- 
cated maximum entropy calculation. The alternative way 
via measuring the current cumulants during evolution is 
spoiled by large sample-to-sample fluctuations. Here we 
propose a model which, though inspired by this, does not 
have to deal with these problems. For an a priori un- 
known energy function (being the usual case when deal- 
ing with optimization problems) a less formal framework 
is needed. This is supplied by a dynamical model based 
on energy correlation formulated in Ref. jL3). It will be 
used here to predict an optimal noise rate by maximizing 
the expected performance of the algorithm in each time 
step. In the following we will define two test functions 
and two algorithms to be considered. A model for the 
prediction of the mutation effects will then be given. An 
improved algorithm will be defined on the basis of this 
model. It is then applied to the test functions and its per- 
formance compared to standard versions of the respective 
algorithms. 



II. TEST SYSTEM 

Let us consider an optimization problem in terms of 
the minimization of a real valued function E(S) on a bi- 
nary search space. Its value is the energy of the test 
point S which is to be minimized, equivalent to a search 
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for the ground state energy of a physical system p4[ . 
In biological terms this corresponds to the negative "fit- 
ness" of an organism to be maximized for survival. The 
discrete search space will be parametrized through the 
binary representation S G {±1}^ of length N. 

Two functions serve as examples: one purely additive 
energy function, and another with many local optima. 
The first problem is a random field paramagnet 
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E a = y j Ji s\ + /tj 



(i) 



with random couplings Ji taken from a Gaussian distri- 
bution with mean and variance 1 . The N spins Sf with 
i = 1, . . . , N and 5f = ±1 form the genetic string of the 
member a of the population. The second function is the 
NK model energy function |lf| as an example for a hard 
search problem with many local minima. It is defined 
through 
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with 2 K+1 random energy values Ei(S a ) drawn from a 
uniform distribution over the interval [0, 1] and a ran- 
domly chosen permutation of sites i% to %k , both for each 
i. Originally, this function has been formulated for the 
study of evolution on tunably rugged energy landscapes 
with application to the evolution of the immune response 
[p~6|| . These functions will be minimized by means of 
a population-based algorithm which is defined as fol- 
lows: First, a random ensemble of search points S a with 
a = 1, . . . , P and energies E a — E(S a ) is chosen (a "pop- 
ulation" ) . One time step of the algorithm consists of the 
following procedure: 

1. Select the member with the lowest energy. 

2. Reproduce it once. 

3. Replace the member with the highest energy by the 
new copy. 

4. "Mutate" all members except the original one with 
the lowest energy by inverting each spin with a 
small fixed probability 7. 

Repeating these steps then forms an evolutionary algo- 
rithm searching for low lying energy states. It is driven 
by selection lowering the mean energy of the population 
and mutation increasing the variance. 

For comparison let us also look at a simplified algo- 
rithm, a stochastic gradient descent. After the initial 
population is chosen as above, the following steps are 
taken: 

1. Select the member with the lowest energy. 



2. Reproduce it P — 1 times. 

3. "Mutate" the new copies by inverting each spin 
with a small fixed probability 7. 

4. Replace all members, except the one with the low- 
est energy, by the mutated copies. 

Here, the offspring of the best member takes over the 
entire population in each time step. 



III. MODELING THE ALGORITHM DYNAMICS 

We will model the dynamics of these algorithms in 
terms of the energy distribution p(E) of the population 
expressed as an expansion in cumulants. The energy dis- 
tribution p{E) of a population is the natural quantity 
for the selection operator, which solely acts on the en- 
ergy values of the search points. The expansion in cu- 
mulants of the energy distribution p(E) has been shown 
to be a useful approximation for population-based algo- 
rithms At each time step, the evolving population 
is then approximated by a set of these variables. When 
also modeling the mutation operator, one has to be more 
careful. Mutation, acting on the underlying represen- 
tation instead of the energy values themselves, requires 
additional assumptions to model its dynamics in terms 
of p(E). For example, one could obtain a maximum like- 
lihood estimation for the underlying spin states corre- 
sponding to a given energy cumulant, and use this to 
calculate the expected effect of the mutation operator. 
However, such a procedure requires simple energy func- 
tions to allow for the calculation, and, of course, a com- 
plete knowledge of the energy function, which is usually 
not available for realistic optimization problems. 

Here we use a different approach which is based solely 
on a phcnomcnological parameter that is accessible by 
measurement if the energy function is not known [fl3|| . 
In particular, we will use a model for the lowest order 
average dynamics of the mutation operator on the basis 
of the energy correlation m of mutation on a given energy 
landscape E(S), 
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where ( ) denotes an average over the population, and 
( ) t an average over all possible mutation events. The 
energy correlation m is a measure of how strongly, on 
average, the energy of a mutant is correlated to that of its 
parent, for a given mutation operator applied to a given 
energy function. Such correlations form the backbone on 
which the search process in mutation-based algorithms 
proceeds. Energy correlations can be measured for many 
hard optimization problems, as were recently classified in 
Ref. [la]. In this framework, the energy distribution of 



2 



a population after mutation can be approximated by its 
cumulants as a function of m, 

k™ = m Ki + (1 - to) k° 



k 2 + (1 - m 2 ) 



^4 
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(4) 



where and k° are the energy mean and variance of 
a random initial population. This model was derived in 
Ref. p3| . The underlying assumption of this model is 
that the population of the algorithm (not the landscape 
itself) can be expanded in cumulants around a Gaussian. 
In fact, one observes that the initial random population 
for many real optimization problems fulfills this require- 
ment well. In the framework of this expansion, the above 
model uses m to predict the expected energy distribution 
of the population in the next time step. Such a model, 
based on energy correlations, further assumes that the 
lowest order correlation m contains major information 
about the average effect of the mutation operator. It has 
been shown to be useful to describe the dynamics of a 
population-based algorithm over at least 200 generations, 
both for correlated and poorly correlated landscapes jl3| . 

How can such a model be used to improve an opti- 
mization algorithm? Let us look at a numerical example 
for the dynamics of a stochastic gradient descent under a 
fixed mutation rate 7, as shown in Fig. [l]. Optimization 
of the first test function (|l|) is shown with a stochastic 
gradient descent, searching for the minimal energy con- 
figuration of a random paramagnet of N — 128 spins in 
an external field. For a large mutation rate 7 one sees 
that the early gain is large, whereas for small 7, as shown 
by the solid curve, a poor early gain is balanced later by 
a slow but steady improvement. For optimization prob- 
lems involving computationally costly energy evaluation, 
this behavior poses a severe problem. Knowledge of the 
latter stages of the dynamics would be needed at the be- 
ginning in order to be able to choose an optimal 7. In the 
following, this problem will be addressed through a vari- 
able mutation rate j(t), that combines the advantages of 
both regimes of the mutation rate 7. 



IV. ANNEALING THE MUTATION RATE 

For this purpose, the expected best member of a pop- 
ulation after mutation (£ m ; n ) is evaluated on the basis 
of the energy distribution p m {E) of the population after 
mutation given in terms of cumulants k™. The expecta- 
tion value for the lowest energy occurring in a set of P 
samples p"^ ] drawn from the post-mutation distribution 
p m (E) is 
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(E min ) = P J dBi Ex p m {E 1 ) n f dE n p m (E n ). 

(5) 

In the Gaussian approximation a saddle point expansion 
yields, to leading order, 
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(6) 



Inserting the post-mutation distribution (Ji]), the ex- 
pected energy of the best member after mutation (E m - ln ) 
can be minimized in terms of m. The resulting mutation 
correlation m op t is then used to choose the mutation rate 
7 in the forthcoming mutation step, thereby optimizing 
the expected best member of the next generation. Un- 
fortunately, this method is plagued by large fluctuations 
in the measured moments of the energy distribution. 

Therefore, let us first look at the expected dynamics of 
the stochastic gradient descent where this problem does 
not occur. Following Eq. (Q), the energy distribution 
after the mutation step is given in the Gaussian approx- 
imation by 



K ? = m E min (t) + (1 - m) *5 
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(7) 



Inserting this into Eq. (^), and minimizing the expected 
best member of the next generation (E m - m (t + 1)) with 
respect to m, yields an estimate for an optimal correla- 
tion: 
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(E min (t) - n\f 



\ 2k 2 ln(P) + (E min (t) - Kif 



(8) 



This is subsequently translated into an optimal mutation 
rate 7 opt via 



l-2 7 , 



(9) 



which is derived from the known energy function. Each 
time step of the modified algorithm can now be described 
as follows: 

1. Determine the lowest energy E m i n in the popula- 
tion. 

2. Calculate the optimal correlation m opt from Eq. (|^) 
and calculate the mutation rate 7 op t from this. 

3. Select the member with the lowest energy. 

4. Reproduce it P — 1 times. 

5. "Mutate" the new copies by inverting each spin 
with the mutation rate 7opt- 

6. Replace all members, except the one with the low- 
est energy, by the mutated copies. 
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Starting from an initial condition as above and iterating 
this step results in an algorithm with an adaptive mu- 
tation rate. How it applies to the above test function 
is shown in Fig. |l|. At each time scale, the evolution 
of the lowest energy member of the evolving population 
compares well to the respective best "fixed mutation rate 
algorithm" . No explicit knowledge of favorable ranges of 
the mutation rate 7 is used, thus removing the free pa- 
rameter 7 from the algorithm. Applying the formalism 
to the NK-model function (Q) , and using the relation be- 
tween parent child correlation m and mutation rate 7, 
derived as 



m = (1 — 7) 



K+l 



(10) 



a comparable result is obtained (Fig. ||) . 

A similar procedure can also be carried out for the 
full population-based algorithm with sparse replication. 
Again, Eq. (|J) is used to adjust the mutation rate in the 
next generation to a value that maximizes the expected 
gain. In order to avoid large fluctuations, which would 
be incompatible with a smooth evolution, we do not base 
the prediction on the cumulants of the current energy 
distribution in the population, but rather on E mm alone. 
This is done in the spirit of Eq. (Q), which is less likely 
to fluctuate than the prediction based on the full cumu- 
lants Ki. However, E m { n still relates to the dynamics of 
a mixed population, and proves to be useful in model- 
ing the population dynamics under mutation. Depend- 
ing on the mutation stregth, a number of former mutants 
are still correlated with the new offspring, in addition to 
the one copy of -Emm made per generation. Let us as- 
sume that a number of M members of the population 
are strongly correlated with the new offspring. For sim- 
plicity we further assume that the remaining members 
are completely uncorrelated and treat them as random. 
In this approximation, the integral for the expected best 
member of a population can be written as 



(E min )=M J dE 1 E lP m {E 1 ) 



dE 2 p m (E 2 ) 



Lfc'i 



M-l 



dE 3 p°(E 3 ) 



P-l-M 



00 

-(P-l-M) J dE 1 E 1 p°(E 1 ) 



dE 2 p m {E 2 ) 



Lfc'i 



dE 3 p°(E 3 ) 



Lfc'i 



P-2-M 



(11) 



It is solved using a saddle point expansion in the Gaus- 
sian approximation, considering the limit where the dis- 
tributions p m and p° move sufficiently apart from each 



other (due to E m { n moving away from the random popu- 
lation distribution), where one can neglect their mutual 
variations. One obtains 



{E n 



2n 2 n \n{M - 1) 
2k° ln(P-2-M). 



(12) 



The expected (E m i n (t +1)) of the next generation based 
on Eq. (^) is then minimized by the mutation rate 
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(13) 



Finally, the number of correlated members M in the pop- 
ulation remains to be specified. For a lowest order esti- 
mate let us consider a member with energy E m - m and 
mutate it k times. We then require that its energy does 
not, on average, move away more than y^K™ from the 
current value of E nl [ ni i.e., 



E min + V2rtf > m k E min + (1 - m k ) k\. (14) 

The exact limit for the number of subsequent mutations 
k depends on the current details of the energy values 
in the population. However, when using Eq. dl3| ) as an 
estimate for the current value of m, the energy value of 
a mutant decorrelates after only a few mutation steps. 
Therefore, ln(M — 1) is estimated to be of the order of 
1 and we determine the optimal mutation rate in the 
algorithm using 



^opt 



/ (E min (t) - 

2k° 2 + (E min (t) - /s°)2 ' 



(15) 



This expression is now used for annealing the mutation 
rate in the population based algorithm. The modified 
time step of the algorithm is defined by the following 
procedure: 

1. Determine the lowest energy J? m i n in the popula- 
tion. 

2. Calculate the optimal correlation m opt from Eq. 
(p"5|), and calculate the mutation rate 7 op t from it. 

3. Select the member with the lowest energy. 

4. Reproduce it once. 

5. Replace the member with the highest energy by the 
new copy. 

6. "Mutate" all members except the original one with 
the lowest energy by inverting each spin with the 
probability 7 opt . 
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Again starting from an initial condition as above and it- 
erating this step results in an algorithm with annealed 
mutation rate. In Fig. || the evolution of the best pop- 
ulation member on the basis of this algorithm is com- 
pared to runs with fixed mutation rates. The algorithm 
adjusting the mutation rate compares well to the fixed 
mutation rate cases at each stage of evolution. In Fig. |] 
the algorithm is applied to the NK-model function with 
similar results. For any given resource of CPU time, one 
reaches a level of performance comparable to an optimum 
fixed mutation rate (at the given total evolution time). 
This is helpful in optimization when the relationship be- 
tween mutation rate 7 and the algorithm dynamics at 
later times is a priori unknown. 

V. DISCUSSION 

For both algorithms considered above, we have seen 
how annealing the mutation rate can be based on a sim- 
ple dynamical model based on the energy correlation of 
the mutation operator. In the presented examples, func- 
tions with known analytical properties have been con- 
sidered, enabling a direct calculation of the mutation 
correlation 771(7). However, when applying the above 
method to general optimization problems, this functional 
dependence remains to be established. For many real- 
istic optimization problems it is well approximated by 
a monotonic function with a simple decay law in the 
small 7 regime, as classified in Ref. [18] for a number 
of different optimization problems. For many problems 
it can be modeled using the simple linear approximation 
7(777) = I — x to. In order to apply the above algorithms 
to optimization problems where the energy function is 
not known, a heuristics that measures this relation for 
a given problem has been defined. One possibility is to 
measure m and improve the estimate for x during the run 
of the above algorithms. This procedure can be defined 
as follows: 

1. Start from an initial estimate for x. 

2. Measure the mutation correlation m during each 
time step of the algorithm using ([|) . 

3. Use the measured m to improve the estimate for x 
in the linear approximation (taken as the average 
over all measured values of x so far). 

This allows one to apply the method to energy functions 
with no a priori knowledge of their correlation structure. 
This method has been sucessfully tested using the two 
energy functions of this study. 

Several extensions remain to be studied, e.g., algo- 
rithms where recombination, or "crossover", is present. 
In such algorithms, the annealed mutation as described 
here is expected to work equally well as long as the mu- 
tation step does not strongly interact with the crossover. 



Whether the recombination strength can be adapted in 
a similar way is an open question. Another free parame- 
ter is introduced by selection, namely, selection strength. 
Here a one parameter model exists [ flO| , and an adaptive 
adjustment could be discussed as well. 

To summarize, we proposed a mechanism for annealing 
the mutation rate in population-based algorithms. It is 
based on a statistical mechanics model of the population 
dynamics and a correlation measure of the mutation op- 
erator. The mutation rate 7 thereby drops out as a free 
parameter of the algorithm. 
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FIGURES 



FIG. 1. The evolution of the member with maximum fit- 
ness / = — -E m in is shown at different fixed mutation rates 7 
of a stochastic gradient descent for the random paramagnet. 
In comparison, the points show the dynamics of the adaptive 
mutation algorithm. In all simulations a quenched average 
over 200 runs is shown, with a random energy function cho- 
sen once. The dotted line denotes the global optimum of the 
function. 

FIG. 2. Same as Fig. 1 for an evolution on the rugged land- 
scape of the NK-model energy function with K = 8. 

FIG. 3. Adaptive mutation in the population-based algo- 
rithm compared to the fixed mutation case for a random para- 
magnet, with conventions chosen as in the previous figures. 
The dotted line denotes the global optimum of the function. 

FIG. 4. Adaptive mutation in the population-based algo- 
rithm compared to a fixed mutation rate for the NK model. 
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